Introduction

Prosper: is the country’s first peer-to-peer lending marketplace. The company has provided more than $2,000,000,000 in loans. Loan interest rates range from 5.99% for the most credit worthy borrowers to 36.00% APR for consumers with lower credit ratings. Borrowers can obtain loans from $2,000 up to $35,000.

Best for People looking to refinance debt, individuals starting a business, consumers facing financial hardship and those looking to finance a major life event. Source (https://www.consumeraffairs.com/finance/prosper.html)

As a potential data scientist, I will explore the data about borrower market to learn about the borrowers behavior, loan demographic segmentation, and the performance of Prosper in terms of the volume of listings by year and by area.

Data Set Option

Loan Data from Prosper: Last update: 03/11/2014. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

To learn more about the data please visit this variable dictionary which explains the variables in the data set. https://goo.gl/m9hNi4

Research Questions

Analysis

Packages requiered: These packages are requiered for this EDA. To install, please execute the following code.

Retrieve or set the dimension of an object

This data set contains 113,937 loans with 81 variables.

## [1] 113937     81

The data set includes different kind of variables like loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The variables are from different types. To do that I will use summary function which is a generic function used to produce result summaries of the results of various model fitting functions.

##                    ListingKey     ListingNumber    
##  17A93590655669644DB4C06:     6   Min.   :      4  
##  349D3587495831350F0F648:     4   1st Qu.: 400919  
##  47C1359638497431975670B:     4   Median : 600554  
##  8474358854651984137201C:     4   Mean   : 627886  
##  DE8535960513435199406CE:     4   3rd Qu.: 892634  
##  04C13599434217079754AEE:     3   Max.   :1255725  
##  (Other)                :113912                    
##                     ListingCreationDate  CreditGrade         Term      
##  2013-10-02 17:20:16.550000000:     6          :84984   Min.   :12.00  
##  2013-08-28 20:31:41.107000000:     4   C      : 5649   1st Qu.:36.00  
##  2013-09-08 09:27:44.853000000:     4   D      : 5153   Median :36.00  
##  2013-12-06 05:43:13.830000000:     4   B      : 4389   Mean   :40.83  
##  2013-12-06 11:44:58.283000000:     4   AA     : 3509   3rd Qu.:36.00  
##  2013-08-21 07:25:22.360000000:     3   HR     : 3508   Max.   :60.00  
##  (Other)                      :113912   (Other): 6745                  
##                  LoanStatus                  ClosedDate   
##  Current              :56576                      :58848  
##  Completed            :38074   2014-03-04 00:00:00:  105  
##  Chargedoff           :11992   2014-02-19 00:00:00:  100  
##  Defaulted            : 5018   2014-02-11 00:00:00:   92  
##  Past Due (1-15 days) :  806   2012-10-30 00:00:00:   81  
##  Past Due (31-60 days):  363   2013-02-26 00:00:00:   78  
##  (Other)              : 1108   (Other)            :54633  
##   BorrowerAPR       BorrowerRate     LenderYield     
##  Min.   :0.00653   Min.   :0.0000   Min.   :-0.0100  
##  1st Qu.:0.15629   1st Qu.:0.1340   1st Qu.: 0.1242  
##  Median :0.20976   Median :0.1840   Median : 0.1730  
##  Mean   :0.21883   Mean   :0.1928   Mean   : 0.1827  
##  3rd Qu.:0.28381   3rd Qu.:0.2500   3rd Qu.: 0.2400  
##  Max.   :0.51229   Max.   :0.4975   Max.   : 0.4925  
##  NA's   :25                                          
##  EstimatedEffectiveYield EstimatedLoss   EstimatedReturn 
##  Min.   :-0.183          Min.   :0.005   Min.   :-0.183  
##  1st Qu.: 0.116          1st Qu.:0.042   1st Qu.: 0.074  
##  Median : 0.162          Median :0.072   Median : 0.092  
##  Mean   : 0.169          Mean   :0.080   Mean   : 0.096  
##  3rd Qu.: 0.224          3rd Qu.:0.112   3rd Qu.: 0.117  
##  Max.   : 0.320          Max.   :0.366   Max.   : 0.284  
##  NA's   :29084           NA's   :29084   NA's   :29084   
##  ProsperRating..numeric. ProsperRating..Alpha.  ProsperScore  
##  Min.   :1.000                  :29084         Min.   : 1.00  
##  1st Qu.:3.000           C      :18345         1st Qu.: 4.00  
##  Median :4.000           B      :15581         Median : 6.00  
##  Mean   :4.072           A      :14551         Mean   : 5.95  
##  3rd Qu.:5.000           D      :14274         3rd Qu.: 8.00  
##  Max.   :7.000           E      : 9795         Max.   :11.00  
##  NA's   :29084           (Other):12307         NA's   :29084  
##  ListingCategory..numeric. BorrowerState  
##  Min.   : 0.000            CA     :14717  
##  1st Qu.: 1.000            TX     : 6842  
##  Median : 1.000            NY     : 6729  
##  Mean   : 2.774            FL     : 6720  
##  3rd Qu.: 3.000            IL     : 5921  
##  Max.   :20.000                   : 5515  
##                            (Other):67493  
##                     Occupation         EmploymentStatus
##  Other                   :28617   Employed     :67322  
##  Professional            :13628   Full-time    :26355  
##  Computer Programmer     : 4478   Self-employed: 6134  
##  Executive               : 4311   Not available: 5347  
##  Teacher                 : 3759   Other        : 3806  
##  Administrative Assistant: 3688                : 2255  
##  (Other)                 :55456   (Other)      : 2718  
##  EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
##  Min.   :  0.00           False:56459         False:101218    
##  1st Qu.: 26.00           True :57478         True : 12719    
##  Median : 67.00                                               
##  Mean   : 96.07                                               
##  3rd Qu.:137.00                                               
##  Max.   :755.00                                               
##  NA's   :7625                                                 
##                     GroupKey                 DateCreditPulled 
##                         :100596   2013-12-23 09:38:12:     6  
##  783C3371218786870A73D20:  1140   2013-11-21 09:09:41:     4  
##  3D4D3366260257624AB272D:   916   2013-12-06 05:43:16:     4  
##  6A3B336601725506917317E:   698   2014-01-14 20:17:49:     4  
##  FEF83377364176536637E50:   611   2014-02-09 12:14:41:     4  
##  C9643379247860156A00EC0:   342   2013-09-27 22:04:54:     3  
##  (Other)                :  9634   (Other)            :113912  
##  CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.0         Min.   : 19.0        
##  1st Qu.:660.0         1st Qu.:679.0        
##  Median :680.0         Median :699.0        
##  Mean   :685.6         Mean   :704.6        
##  3rd Qu.:720.0         3rd Qu.:739.0        
##  Max.   :880.0         Max.   :899.0        
##  NA's   :591           NA's   :591          
##         FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
##                     :   697     Min.   : 0.00      Min.   : 0.00  
##  1993-12-01 00:00:00:   185     1st Qu.: 7.00      1st Qu.: 6.00  
##  1994-11-01 00:00:00:   178     Median :10.00      Median : 9.00  
##  1995-11-01 00:00:00:   168     Mean   :10.32      Mean   : 9.26  
##  1990-04-01 00:00:00:   161     3rd Qu.:13.00      3rd Qu.:12.00  
##  1995-03-01 00:00:00:   159     Max.   :59.00      Max.   :54.00  
##  (Other)            :112389     NA's   :7604       NA's   :7604   
##  TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   :  2.00             Min.   : 0.00        
##  1st Qu.: 17.00             1st Qu.: 4.00        
##  Median : 25.00             Median : 6.00        
##  Mean   : 26.75             Mean   : 6.97        
##  3rd Qu.: 35.00             3rd Qu.: 9.00        
##  Max.   :136.00             Max.   :51.00        
##  NA's   :697                                     
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries   
##  Min.   :    0.0             Min.   :  0.000      Min.   :  0.000  
##  1st Qu.:  114.0             1st Qu.:  0.000      1st Qu.:  2.000  
##  Median :  271.0             Median :  1.000      Median :  4.000  
##  Mean   :  398.3             Mean   :  1.435      Mean   :  5.584  
##  3rd Qu.:  525.0             3rd Qu.:  2.000      3rd Qu.:  7.000  
##  Max.   :14985.0             Max.   :105.000      Max.   :379.000  
##                              NA's   :697          NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   DelinquenciesLast7Years
##  Min.   : 0.0000      Min.   :     0.0   Min.   : 0.000         
##  1st Qu.: 0.0000      1st Qu.:     0.0   1st Qu.: 0.000         
##  Median : 0.0000      Median :     0.0   Median : 0.000         
##  Mean   : 0.5921      Mean   :   984.5   Mean   : 4.155         
##  3rd Qu.: 0.0000      3rd Qu.:     0.0   3rd Qu.: 3.000         
##  Max.   :83.0000      Max.   :463881.0   Max.   :99.000         
##  NA's   :697          NA's   :7622       NA's   :990            
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   : 0.0000          Min.   : 0.000            Min.   :      0       
##  1st Qu.: 0.0000          1st Qu.: 0.000            1st Qu.:   3121       
##  Median : 0.0000          Median : 0.000            Median :   8549       
##  Mean   : 0.3126          Mean   : 0.015            Mean   :  17599       
##  3rd Qu.: 0.0000          3rd Qu.: 0.000            3rd Qu.:  19521       
##  Max.   :38.0000          Max.   :20.000            Max.   :1435667       
##  NA's   :697              NA's   :7604              NA's   :7604          
##  BankcardUtilization AvailableBankcardCredit  TotalTrades    
##  Min.   :0.000       Min.   :     0          Min.   :  0.00  
##  1st Qu.:0.310       1st Qu.:   880          1st Qu.: 15.00  
##  Median :0.600       Median :  4100          Median : 22.00  
##  Mean   :0.561       Mean   : 11210          Mean   : 23.23  
##  3rd Qu.:0.840       3rd Qu.: 13180          3rd Qu.: 30.00  
##  Max.   :5.950       Max.   :646285          Max.   :126.00  
##  NA's   :7604        NA's   :7544            NA's   :7544    
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months
##  Min.   :0.000                      Min.   : 0.000         
##  1st Qu.:0.820                      1st Qu.: 0.000         
##  Median :0.940                      Median : 0.000         
##  Mean   :0.886                      Mean   : 0.802         
##  3rd Qu.:1.000                      3rd Qu.: 1.000         
##  Max.   :1.000                      Max.   :20.000         
##  NA's   :7544                       NA's   :7544           
##  DebtToIncomeRatio         IncomeRange    IncomeVerifiable
##  Min.   : 0.000    $25,000-49,999:32192   False:  8669    
##  1st Qu.: 0.140    $50,000-74,999:31050   True :105268    
##  Median : 0.220    $100,000+     :17337                   
##  Mean   : 0.276    $75,000-99,999:16916                   
##  3rd Qu.: 0.320    Not displayed : 7741                   
##  Max.   :10.010    $1-24,999     : 7274                   
##  NA's   :8554      (Other)       : 1427                   
##  StatedMonthlyIncome                    LoanKey       TotalProsperLoans
##  Min.   :      0     CB1B37030986463208432A1:     6   Min.   :0.00     
##  1st Qu.:   3200     2DEE3698211017519D7333F:     4   1st Qu.:1.00     
##  Median :   4667     9F4B37043517554537C364C:     4   Median :1.00     
##  Mean   :   5608     D895370150591392337ED6D:     4   Mean   :1.42     
##  3rd Qu.:   6825     E6FB37073953690388BC56D:     4   3rd Qu.:2.00     
##  Max.   :1750003     0D8F37036734373301ED419:     3   Max.   :8.00     
##                      (Other)                :113912   NA's   :91852    
##  TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   :  0.00             Min.   :  0.00       
##  1st Qu.:  9.00             1st Qu.:  9.00       
##  Median : 16.00             Median : 15.00       
##  Mean   : 22.93             Mean   : 22.27       
##  3rd Qu.: 33.00             3rd Qu.: 32.00       
##  Max.   :141.00             Max.   :141.00       
##  NA's   :91852              NA's   :91852        
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   : 0.00                       Min.   : 0.00                  
##  1st Qu.: 0.00                       1st Qu.: 0.00                  
##  Median : 0.00                       Median : 0.00                  
##  Mean   : 0.61                       Mean   : 0.05                  
##  3rd Qu.: 0.00                       3rd Qu.: 0.00                  
##  Max.   :42.00                       Max.   :21.00                  
##  NA's   :91852                       NA's   :91852                  
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   :    0            Min.   :    0              
##  1st Qu.: 3500            1st Qu.:    0              
##  Median : 6000            Median : 1627              
##  Mean   : 8472            Mean   : 2930              
##  3rd Qu.:11000            3rd Qu.: 4127              
##  Max.   :72499            Max.   :23451              
##  NA's   :91852            NA's   :91852              
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-209.00             Min.   :   0.0           
##  1st Qu.: -35.00             1st Qu.:   0.0           
##  Median :  -3.00             Median :   0.0           
##  Mean   :  -3.22             Mean   : 152.8           
##  3rd Qu.:  25.00             3rd Qu.:   0.0           
##  Max.   : 286.00             Max.   :2704.0           
##  NA's   :95009                                        
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber    
##  Min.   : 0.00                 Min.   :  0.0              Min.   :     1  
##  1st Qu.: 9.00                 1st Qu.:  6.0              1st Qu.: 37332  
##  Median :14.00                 Median : 21.0              Median : 68599  
##  Mean   :16.27                 Mean   : 31.9              Mean   : 69444  
##  3rd Qu.:22.00                 3rd Qu.: 65.0              3rd Qu.:101901  
##  Max.   :44.00                 Max.   :100.0              Max.   :136486  
##  NA's   :96985                                                            
##  LoanOriginalAmount          LoanOriginationDate LoanOriginationQuarter
##  Min.   : 1000      2014-01-22 00:00:00:   491   Q4 2013:14450         
##  1st Qu.: 4000      2013-11-13 00:00:00:   490   Q1 2014:12172         
##  Median : 6500      2014-02-19 00:00:00:   439   Q3 2013: 9180         
##  Mean   : 8337      2013-10-16 00:00:00:   434   Q2 2013: 7099         
##  3rd Qu.:12000      2014-01-28 00:00:00:   339   Q3 2012: 5632         
##  Max.   :35000      2013-09-24 00:00:00:   316   Q2 2012: 5061         
##                     (Other)            :111428   (Other):60343         
##                    MemberKey      MonthlyLoanPayment LP_CustomerPayments
##  63CA34120866140639431C9:     9   Min.   :   0.0     Min.   :   -2.35   
##  16083364744933457E57FB9:     8   1st Qu.: 131.6     1st Qu.: 1005.76   
##  3A2F3380477699707C81385:     8   Median : 217.7     Median : 2583.83   
##  4D9C3403302047712AD0CDD:     8   Mean   : 272.5     Mean   : 4183.08   
##  739C338135235294782AE75:     8   3rd Qu.: 371.6     3rd Qu.: 5548.40   
##  7E1733653050264822FAA3D:     8   Max.   :2251.5     Max.   :40702.39   
##  (Other)                :113888                                         
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-664.87  
##  1st Qu.:  500.9              1st Qu.:  274.87   1st Qu.: -73.18  
##  Median : 1587.5              Median :  700.84   Median : -34.44  
##  Mean   : 3105.5              Mean   : 1077.54   Mean   : -54.73  
##  3rd Qu.: 4000.0              3rd Qu.: 1458.54   3rd Qu.: -13.92  
##  Max.   :35000.0              Max.   :15617.03   Max.   :  32.06  
##                                                                   
##  LP_CollectionFees  LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-9274.75   Min.   :  -94.2       Min.   : -954.5    
##  1st Qu.:    0.00   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.00   Median :    0.0       Median :    0.0    
##  Mean   :  -14.24   Mean   :  700.4       Mean   :  681.4    
##  3rd Qu.:    0.00   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.00   Max.   :25000.0       Max.   :25000.0    
##                                                              
##  LP_NonPrincipalRecoverypayments PercentFunded    Recommendations   
##  Min.   :    0.00                Min.   :0.7000   Min.   : 0.00000  
##  1st Qu.:    0.00                1st Qu.:1.0000   1st Qu.: 0.00000  
##  Median :    0.00                Median :1.0000   Median : 0.00000  
##  Mean   :   25.14                Mean   :0.9986   Mean   : 0.04803  
##  3rd Qu.:    0.00                3rd Qu.:1.0000   3rd Qu.: 0.00000  
##  Max.   :21117.90                Max.   :1.0125   Max.   :39.00000  
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   : 0.00000           Min.   :    0.00            Min.   :   1.00  
##  1st Qu.: 0.00000           1st Qu.:    0.00            1st Qu.:   2.00  
##  Median : 0.00000           Median :    0.00            Median :  44.00  
##  Mean   : 0.02346           Mean   :   16.55            Mean   :  80.48  
##  3rd Qu.: 0.00000           3rd Qu.:    0.00            3rd Qu.: 115.00  
##  Max.   :33.00000           Max.   :25000.00            Max.   :1189.00  
## 

Univariate Plots Section

Tip: In this section, what I want to find are some geographic and seasonals behavioral patterns. To do that, I need to create some columns related to dates.

Listing by Geography Location

Listing By Borrower State

As we can observe, there are loans spread out around all country. This could mean that the company has network office distribute around the country, or maybe it offer a website which is known and accessed from every country. Also, this could mean that the company has a good reputation and confidence, so people around the country demands its services.

Group data by BorrowerState

Even though the loans are around the country, there are some states which have more number of loans than others. The table and the graph below show the top five states with the higher amount of loans. In order descendent order these states are: California, Texas, New York, Florida, and Illinois.

BorrowerState state.name LoanOriginalAmount_mean LoanOriginalAmount_median n
CA california 8974.326 7000 14717
TX texas 9087.853 7500 6842
NY new york 8833.034 7000 6729
FL florida 8207.461 6500 6720
IL illinois 8395.931 6500 5921

Althought the number of loans are important, more important is the amount of money these loans produced. So, let’s identify which states makes more money. As you can see the top five states which more money produced are: Texas, California, New York, Illinois, and Florida.

BorrowerState state.name LoanOriginalAmount_mean LoanOriginalAmount_median n
TX texas 9087.853 7500 6842
CA california 8974.326 7000 14717
NY new york 8833.034 7000 6729
IL illinois 8395.931 6500 5921
FL florida 8207.461 6500 6720

Listing creation in prosper by time

Listing Creation Date - Year

Once I have a interesting idea about loan geographic segmentation, now I want to determine how loans evolved during the time. In the first graphic below, I can observe that two years (2006 and 2009) the loans were very low. Probably the first year (2006) was the beginning of the company, and the second there was something problem in the economy. During the other years, the amount of loans were growing because of Economy recovery. The other two graphics show us a constant value in the amount of money required in the loans.

Listing Creation Date - Month

Even though there are some years with more amount of loans (2013), I can not observe a seasonal behavior per month in a year. This could mean that people need money not for a specific reason like Holidays.

Listing Creation Date - Day

As the same of months behavior, we cannot observe in a clear way a seasonal behavior in the days of each month.

Listing by Category

To complete this first analysis, I want to discoverer the principal reason why people get a loan. To do that, I will analyse the category of the listing that the borrower selected when posting their listing. Because this variable is numeric, I attached here the meaning of each number: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

In the first graph, we can observe that the principal reason to make a loan during the years are the category 0 (Not Available), however if we limit to get the quantile 0.95 in the y axis, the principal columns are 0 (Not Available) and 7 (Other). We can also confirm this pattern in the second graphics where I split the data also for Term. It is totally clear that these reasult don’t give as much information, also we can mention that the dataset could be improve in this variable. Considering the next result, the most important categories are: 2 (Home Improvement) and 3 (Business).

Finally, we can determine about categories is that the amount of data is very regular in most of the categories.

Univariate Analysis

What is the structure of your dataset?

This data set contains 113,937 loans with 81 variables. The data set includes different kind of variables like loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The variables are from different types. To do that I will use summary function which is a generic function used to produce result summaries of the results of various model fitting functions.

What is/are the main feature(s) of interest in your dataset?

The most important features are:

  • LoanOriginalAmount
  • MonthlyLoanPayment

What other features in the dataset do you think will help support your
analysis?

The help support features are:

  • ListingCreationYear
  • ListingCreationMonth
  • ListingCreationDay
  • Term
  • ListingCategory..numeric.
  • EmploymentStatus
  • LoanStatus
  • BorrowState

Did you create any new variables from existing variables in the dataset?

To analyze how loans behave during the time I created three columns: ListingCreationYear, ListingCreationMonth, and ListingCreationDay .

Of the features you investigated, were there any unusual distributions?

No features has unsual distributions.

Bivariate Plots Section

In this bivariate analysis, what I want to find are common variable relationships according with some ideas about the loan business I have. In that order, I think the following list are natural relationships in loan business.

LoanOriginalAmount vs MonthlyLoanPayment -> Higher loan amount, higher monthly payment.

LoanOriginalAmount vs Investors -> Higher loan amount, higher number of investors.

EstimatedReturn vs Investors -> Higher estimated return, higher investors.

LoanOriginalAmount vs EmploymentStatusDuration -> Higher loan amount, higher employment status duration.

BankcardUtilization vs LoanOriginalAmount -> Higher bankcard utilization, higher loan amount.

LoanOriginalAmount vs MonthlyLoanPayment

The correlation between these two variables are strong positive. It means that higher loan amount result in higher monthly payment.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and MonthlyLoanPayment
## t = 831.75, df = 108040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9292039 0.9308148
## sample estimates:
##       cor 
## 0.9300138

LoanOriginalAmount vs Investors

Even though it exists correlation between these two variables, it is not strong. It means that it is not complete true that higher loan amount imply higher number of investors.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and Investors
## t = 131.61, df = 108040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3665643 0.3768423
## sample estimates:
##       cor 
## 0.3717147

EstimatedReturn vs Investors

The are not correlation between these two variables because the coefficient are very close to 0.

## 
##  Pearson's product-moment correlation
## 
## data:  EstimatedReturn and Investors
## t = -26.797, df = 84523, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09846460 -0.08509518
## sample estimates:
##         cor 
## -0.09178403

LoanOriginalAmount vs EmploymentStatusDuration

The are not correlation between these two variables because the coefficient are very close to 0.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and EmploymentStatusDuration
## t = 31.168, df = 104190, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09009385 0.10212579
## sample estimates:
##        cor 
## 0.09611333

BankcardUtilization vs LoanOriginalAmount

The are not correlation between these two variables because the coefficient are very close to 0.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and BankcardUtilization
## t = -11.102, df = 104210, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04043413 -0.02830564
## sample estimates:
##         cor 
## -0.03437115

Because just one scatter plot show a strong correlation, it is important to define another strategy to identify variable relationships. Also, it is important to mention that people can have some ideas about business which not really are true because of ignorance in that field. For that reason, it is important to be objective when we analyce data.

New Strategy

To improve the variables selection, I am going to calculate the correlation between all numeric variables. In order to achieve this activity, I will modify the data set to maintain only numeric data, and then change column names.

Now, I will drop NA values, and finally I will calculate correlations and show them with a graphic to easy understanding.

As result, the data which have a good correlation coefficient are:

    1. BorrowerRate - BorrowerAPR —> 0.989824
    1. LenderYield - BorrowerAPR —> 0.9893289
    1. LenderYield - BorrowerRate —> 0.9992113
    1. EstimatedLoss -> BorrowerAPR —> 0.9495375
    1. EstimatedLoss -> BorrowerRate —> 0.945297
    1. EstimatedLoss - LenderYield —> 0.9453084
    1. EstimatedReturn -> EstimatedEffectiveYield —> 0.8015679
    1. ProsperRating..numeric. - BorrowerAPR —> -0.9621513
    1. ProsperRating..numeric. - BorrowerRate —> -0.9531049
    1. ProsperRating..numeric. -> LenderYield —> -0.9531194
    1. ProsperRating..numeric. - EstimatedLoss —> -0.9641819
    1. CreditScoreRangeUpper - CreditScoreRangeLower —> 1
    1. OpenCreditLines -> CurrentCreditLines —> 0.9604087
    1. OpenRevolvingAccounts - CurrentCreditLines —> 0.8526853
    1. OpenRevolvingAccounts - OpenCreditLines —> 0.8854484
    1. TotalInquiries - InquiriesLast6Months —> 0.7419499
    1. RevolvingCreditBalance - penRevolvingMonthlyPayment —> 0.7609608
    1. TotalTrades - TotalCreditLinespast7years —> 0.9364824
    1. TotalTrades - OpenCreditLines —> 0.6356007
    1. TotalProsperPaymentsBilled - TotalProsperLoans —> 0.7038035
    1. OnTimeProsperPayments - TotalProsperLoans —> 0.7029944
    1. OnTimeProsperPayments - TotalProsperPaymentsBilled —> 0.9903052
    1. MonthlyLoanPayment - LoanOriginalAmount —> 0.9319837
    1. LP_CustomerPrincipalPayments - LP_CustomerPayments —> 0.97743
    1. LP_InterestandFees - LP_CustomerPayments —> 0.6871885
    1. LP_ServiceFees - LoanOriginalAmount —> -0.483367
    1. LP_ServiceFees - MonthlyLoanPayment —> -0.4620533
    1. LP_ServiceFees - LP_CustomerPayments —> -0.7031245
    1. LP_ServiceFees - LP_CustomerPrincipalPayments —> -0.5769329
    1. LP_ServiceFees - LP_InterestandFees —> -0.8625553

In that order, the new variables chosen to analyze the correlation are:

MonthlyLoanPayment - LoanOriginalAmount

The correlation between these two variables are strong positive. It means that higher loan amount result in higher monthly payment.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and MonthlyLoanPayment
## t = 831.75, df = 108040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9292039 0.9308148
## sample estimates:
##       cor 
## 0.9300138

LP_ServiceFees - LoanOriginalAmount

The correlation between these two variables are strong positive. It means that higher loan amount result in less LP_ServiceFees.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and LP_ServiceFees
## t = -176.49, df = 108040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4776713 -0.4684142
## sample estimates:
##        cor 
## -0.4730558

ProsperRating..numeric. - EstimatedLoss

The correlation between these two variables are strong negative. It means that higher ProsperRating..numeric. result in less EstimatedLoss.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and EstimatedLoss
## t = -1057.6, df = 84523, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9647015 -0.9637542
## sample estimates:
##        cor 
## -0.9642309

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The list below show the initial relationships I thought were natural. However, most of these variables are not correlated each other.

LoanOriginalAmount vs MonthlyLoanPayment -> Higher loan amount, higher monthly payment.

LoanOriginalAmount vs Investors -> Higher loan amount, higher number of investors.

EstimatedReturn vs Investors -> Higher estimated return, higher investors.

LoanOriginalAmount vs EmploymentStatusDuration -> Higher loan amount, higher employment status duration.

BankcardUtilization vs LoanOriginalAmount -> Higher bankcard utilization, higher loan amount.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

In order to identify more useful relationships, I calculate the correlation coefficient between all variables and I find the following list:

    1. BorrowerRate - BorrowerAPR —> 0.989824
    1. LenderYield - BorrowerAPR —> 0.9893289
    1. LenderYield - BorrowerRate —> 0.9992113
    1. EstimatedLoss -> BorrowerAPR —> 0.9495375
    1. EstimatedLoss -> BorrowerRate —> 0.945297
    1. EstimatedLoss - LenderYield —> 0.9453084
    1. EstimatedReturn -> EstimatedEffectiveYield —> 0.8015679
    1. ProsperRating..numeric. - BorrowerAPR —> -0.9621513
    1. ProsperRating..numeric. - BorrowerRate —> -0.9531049
    1. ProsperRating..numeric. -> LenderYield —> -0.9531194
    1. ProsperRating..numeric. - EstimatedLoss —> -0.9641819
    1. CreditScoreRangeUpper - CreditScoreRangeLower —> 1
    1. OpenCreditLines -> CurrentCreditLines —> 0.9604087
    1. OpenRevolvingAccounts - CurrentCreditLines —> 0.8526853
    1. OpenRevolvingAccounts - OpenCreditLines —> 0.8854484
    1. TotalInquiries - InquiriesLast6Months —> 0.7419499
    1. RevolvingCreditBalance - penRevolvingMonthlyPayment —> 0.7609608
    1. TotalTrades - TotalCreditLinespast7years —> 0.9364824
    1. TotalTrades - OpenCreditLines —> 0.6356007
    1. TotalProsperPaymentsBilled - TotalProsperLoans —> 0.7038035
    1. OnTimeProsperPayments - TotalProsperLoans —> 0.7029944
    1. OnTimeProsperPayments - TotalProsperPaymentsBilled —> 0.9903052
    1. MonthlyLoanPayment - LoanOriginalAmount —> 0.9319837
    1. LP_CustomerPrincipalPayments - LP_CustomerPayments —> 0.97743
    1. LP_InterestandFees - LP_CustomerPayments —> 0.6871885
    1. LP_ServiceFees - LoanOriginalAmount —> -0.483367
    1. LP_ServiceFees - MonthlyLoanPayment —> -0.4620533
    1. LP_ServiceFees - LP_CustomerPayments —> -0.7031245
    1. LP_ServiceFees - LP_CustomerPrincipalPayments —> -0.5769329
    1. LP_ServiceFees - LP_InterestandFees —> -0.8625553

What was the strongest relationship you found?

There are two strongest relationships I found: 1) CreditScoreRangeUpper - CreditScoreRangeLower —> 1 and the other: MonthlyLoanPayment - LoanOriginalAmount —> 0.9319837.

Multivariate Plots Section

In this section, what I want to show is how affects a third support variable to the relationships I found in bivariate analysis. In the first image what we can observe is that most of the MonthlyLoanPayment are related to 36 and 60 months.

Also, we can see that most of the people have an employment and an important number of them have a full time jobs.

Then, we can see that a great number of loans were complete, however there are an important number which are currently opened.

Also it is important to mention that most of the loans have a Rate A, AA, and B which means that the portfolio of the company does not have a higher risk.

Finally, we can observe that the principal income of range from people who ask a loan is around $25000 to $75000.

Also, what I can observe in the following graph is that the MonthlyLoanPayment month mean increase according with the EmploymentStatusDuration, however the term no depends of the amount on MonthlyLoanPayment or EmploymentStatusDuration.

In addition, what I can observe in the following graph is that the MonthlyLoanPayment month mean increase in two categories of Listing meanwhile in the others are constant. Also, there is a very lower value in one category. Again, the term no depends of the amount on MonthlyLoanPayment or Listing category.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

In the first image what we can observe is that most of the MonthlyLoanPayment are related to 36 and 60 months. Also, we can see that most of the people have an employment and an important number of them have a full time jobs.Then, we can see that a great number of loans were complete, however there are an important number which are currently opened. Also it is important to mention that most of the loans have a Rate A, AA, and B which means that the portfolio of the company does not have a higher risk. Finally, we can observe that the principal income of range from people who ask a loan is around $25000 to $75000.

Were there any interesting or surprising interactions between features?

Also, what I can observe is that the MonthlyLoanPayment month mean increase according with the EmploymentStatusDuration, however the term no depends of the amount on MonthlyLoanPayment or EmploymentStatusDuration


Final Plots and Summary

Plot One

Description One

These two maps show us that number of loans not always means more money required. For example, in the first map you can observe that state which more number of loans is California (14717 loans), however the state which more amount of money demanded is Texas (9087.326 dollars). These graphs are very useful bacause them show in very clear form the difference between number and amount of money requiered. This result also could help company to determine what kind of strategy marketing they should apply. For example, if company wants to have more number of customers the could develop some strategies in California, but if they want to put more money probably it is a better option to focus in Texas.

Plot Two

Description Two

In this graph, what we can find is that prosper rating decreased while estimated loss increased. It means that the risk is more higher when people have low prosper ratings. This relationship can also be validated if we calculate the coefficient of correlation between these two variables (-0.9641819). Moreover, this graph is helpful because it provides a useful way to visualise the range and other characteristics of responses for a large group.

Plot Three

Description Three

What we can see in this graph is that Monthly Payment is higher when the Term of paying is lower. This means that Monthly Payment is higher when Term is 12 months no matter which kind of category is the loan. Also, we can see that Monthtly Payment is also higher when Term is 60 months. What it is important in this graph is the ease with which it allows you to see how three variables interacts each other.


Reflection


References